Reducing query time of rainfall data using {duckdb} and {arrow} in R
Details
In this demonstration, we use gridded rainfall data from the National Oceanic and Atmospheric Administration (NOAA) to demonstrate the gain in run times and queries using local .duckdb files, the duckdb & arrow R packages.
Note
Some codes relating to connections and credentials have been removed owing to company policy. Only codes needed for demonstration are kept.
This document is prepared as a demonstration of work with database structure and spatial analysis for applying to Code With Us opportunity R Shiny interface for the climr package (downscaled monthly climate data)
All the results are real time
Procedure
The data is saved in two methods.
In Amazon AWS S3 buckets as .parquet files, partitioned by years.
This is how the data is saved.
A local .duckdb file that contains the data. This is a single file.
Results
We choose a min_date and a max_date and use them to query data from the two storage methods mentioned above. We will use an R package tictoc to show the run-time differences.
Show the code
get_daily_data =function(conn, date_min, date_max, years_selected, months_selected){ tictoc::tic() temp <- dplyr::tbl(conn, "daily_precip_data") |>#Need to have the Year Filter ==> because the data is partitioned with the Years, Otherwise it will get very slow. dplyr::filter(Year %in% years_selected) |> dplyr::filter(Month %in% months_selected) |> dplyr::filter(Date >= date_min, Date <= date_max) |> dplyr::group_by(GRIDCODE) |> dplyr::summarise(precip_sum =sum(Precip_daily) /10, #Keep in mind that we store everything as integer ==> need to divide by 10 to get data in mm!From =first(Date),To =last(Date) ) |> dplyr::collect() tictoc::toc()return(temp)}
Notice the run-time improvement with the local file implementation
A similar method may be used for Phase 2 - Database optimization of the project if duckdb is allowed. Otherwise, a similar method using PostGIS may be implemented
Map using mapgl
Finally, a map using {mapgl} to demonstrate my experience of working with climate data.
Show the code
tooltips =glue("<strong>{map_data$COUNTY_NAME}, {map_data$STATE_NAME}</strong> <br><strong>{map_data$precip_sum}mm</strong>")map_data =cbind(map_data, tooltips)colors =c("#fff", viridisLite::turbo(8, direction =-1)[4:8])map =maplibre(style =carto_style("positron")) %>%add_line_layer(id ="border",source = us_states,line_color ="#aaa",line_cap ="round" ) %>%fit_bounds(us_states, animate =FALSE, padding =60) %>%add_fill_layer(id ="precipitation_data",source = map_data, # The joined grid plot with precip_sum valuesfill_color =interpolate(column ="precip_sum", # The column in your dataset to base the colors onvalues =c(0,1,40, 100, 200, 400), # Precip_sum value breakpointsstops = colors, # Colors for each breakpointna_color ="transparent"# Use transparent for NA values or 0 values ),tooltip ="tooltips",hover_options =list(fill_color ="black",fill_opacity =1 ),fill_opacity =0.5# Adjust opacity for non-zero values as needed ) |>add_legend("Cumulated rainfall(mm)",values =c(0,1,40, 100, 200, 400),colors = colors,position ="bottom-right",width ="325px" ) |>add_symbol_layer(id ="state_labels",source = us_states,text_field ="{NAME}", # Correctly reference the 'NAME' field for state namestext_size =13, # Set font size for labelstext_color ="black",text_halo_width =1, # Add halo around text for readabilitytext_halo_color ="white"# Set halo color )map